Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
To be fair I’m not a huge fan of statistics and computer sciences. I’m working as a doctor and doing my PhD on the side. I did one of such courses more than two years ago and i have forgotten most of it by now. There aren’t many courses available to be done fully remotely. Thus I’m happy to have found this one on Sisu. I hope to refresh what I have already done in the past and to learn some new skills that will enable me to analyze data for my research.
The exercise set was a good way to start getting familiar with the course content and checking weather everything is working as it should. The most difficult part for me is organizing big sets of data with different variables as in last section of the exercise. The book gives a good insight to the basic commands in Rstudio. Nowadays it is also very easy to find support simply by googling the issue. So far I liked the instructions on the Moodle platform. Everything seem to be working fine so far. Looking forward to the real tasks.
https://github.com/konradsopyllo/IODS-project.git
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Dec 6 17:47:09 2023"
The text continues here.
Konrad Sopyllo
The data presented below is based on a survey and the data was collected 3.12.2014 - 10.1.2015 in Florence,Italy. The data was translated in Finnish by Kimmo Vehkalahti and Liisa Myyry. The survey was based originally on a finnish study concerning the progress and satisfaction related to the double phased study programme in technical universities.
#Data wrangling
#In this data wrangling part we will focus brielfy on creating the data set as in the csv-file in the data-folder.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url<-"http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt"
data=read_tsv(url)
## Rows: 183 Columns: 60
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): gender
## dbl (59): Aa, Ab, Ac, Ad, Ae, Af, ST01, SU02, D03, ST04, SU05, D06, D07, SU0...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Creating an analysis_dataset with the variables gender, age, attitude, deep, stra, surf and points by combining the values in the learning2014 data. Note that Stra, surf and deep consist of multiple variables.
collumns <- c("gender", "Age", "Attitude", "Points")
analysis_dataset <- data[collumns]
Other variables explained:
d_sm Seeking Meaning
d_ri Relating Ideas
d_ue Use of Evidence
su_lp Lack of Purpose
su_um Unrelated Memorising
su_sb Syllabus-boundness
st_os Organized Studying
st_tm Time Management
Deep Deep approach
Surf Surface approach
Stra Strategic approach
#Stra = st_os+st_tm = ST01+ST09+ST17+ST25+ST04+ST12+ST20+ST28
stra_collumns = unlist(str_split(c("ST01+ST09+ST17+ST25+ST04+ST12+ST20+ST28"), "\\+"))
#surf = su_lp+su_um+su_sb = SU02+SU10+SU18+SU26 + SU05+SU13+SU21+SU29 + SU08+SU16+SU24+SU32
surf_collumns = unlist(str_split(c("SU02+SU10+SU18+SU26+SU05+SU13+SU21+SU29+SU08+SU16+SU24+SU32"), "\\+"))
#deep = d_sm+d_ri+d_ue = D03+D11+D19+D27 + D07+D14+D22+D30 + D06+D15+D23+D31
deep_collumns = unlist(str_split(c("D03+D11+D19+D27+D07+D14+D22+D30+D06+D15+D23+D31"), "\\+"))
#Adding stra, deep and surf by creating scaled variables using rowMeans-function.
analysis_dataset["stra"] = rowMeans(select(data, one_of(stra_collumns)))
analysis_dataset["deep"] = rowMeans(select(data, one_of(deep_collumns)))
analysis_dataset["surf"] = rowMeans(select(data, one_of(surf_collumns)))
#filtering out rows where Points variable = 0
analysis_dataset = analysis_dataset[analysis_dataset$Points != 0,]
dim(analysis_dataset)
## [1] 166 7
#As in the exercise provided the file has 166 observations and 7 variables
#Saving the analysis_dataset as a csv-file
library(readr)
write_csv(analysis_dataset, "data/learning2014.csv")
#Showing how to read the csv-file
loaded_data <- read_csv("data/learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): Age, Attitude, Points, stra, deep, surf
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(loaded_data)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ Age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num [1:166] 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. Age = col_double(),
## .. Attitude = col_double(),
## .. Points = col_double(),
## .. stra = col_double(),
## .. deep = col_double(),
## .. surf = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(loaded_data)
## # A tibble: 6 × 7
## gender Age Attitude Points stra deep surf
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 53 37 25 3.38 3.58 2.58
## 2 M 55 31 12 2.75 2.92 3.17
## 3 F 49 25 24 3.62 3.5 2.25
## 4 M 53 35 10 3.12 3.5 2.25
## 5 M 49 37 22 3.62 3.67 2.83
## 6 F 38 38 21 3.62 4.75 2.42
##Data analysis
In this section we will focus further on analysing the data itself. The brief description of the contents of the data have been presented in the beginning of the file.
#Creating the correlation matrix
library("Hmisc")
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library("corrplot")
## corrplot 0.92 loaded
library(ggplot2)
#Gender correlation
gender_graph <- ggplot(data.frame(analysis_dataset), aes(x = gender)) +
geom_bar(fill = "lightgreen")
print(gender_graph)
#Points correlation
points_dist <- ggplot(data.frame(analysis_dataset), aes(x =Points)) +
geom_density(fill = "lightgreen")
print(points_dist)
#Age correlation
age_dist <- points_dist <- ggplot(data.frame(analysis_dataset), aes(x = Age)) +
geom_density(fill = "lightgreen")
print(points_dist)
#deep correlation
deep_dist <- ggplot(data.frame(analysis_dataset), aes(x = deep)) +
geom_density(fill = "lightgreen")
print(deep_dist)
#surf correlation
surf_dist <- ggplot(data.frame(analysis_dataset), aes(x = surf)) +
geom_density(fill = "lightgreen")
print(surf_dist)
#stra correlation
stra_dist <- ggplot(data.frame(analysis_dataset), aes(x = stra)) +
geom_density(fill = "lightgreen")
print(stra_dist)
#Creating a correlation graph of variables:“Attitude”,“deep”,“stra”,“surf”,“Points” using Pearson
correlations = c("Attitude","deep","stra","surf","Points")
analysis_cor = cor(analysis_dataset[correlations])
corrplot(analysis_cor)
plot(analysis_dataset)
#By having a look at the correlation analysis we can identify the factors with the highest correspondance. Those will be further explained below.
##Attitude vs Points
p1 <- ggplot(data.frame(analysis_dataset), aes(x = Attitude, y = Points))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3+ggtitle("Attitude vs Points")
print(p4)
## `geom_smooth()` using formula = 'y ~ x'
model <- lm(analysis_dataset$Points~analysis_dataset$Attitude)
summary(model)
##
## Call:
## lm(formula = analysis_dataset$Points ~ analysis_dataset$Attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## analysis_dataset$Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
#p-value being lowest in this correlation graph.
#Other way of doing the same thing
library(ggplot2)
qplot(Attitude, Points, data = analysis_dataset) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
my_model <- lm(Points ~ Attitude, data = analysis_dataset)
#surf vs deep
p1 <- ggplot(data.frame(analysis_dataset), aes(x = deep, y = surf))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3+ggtitle("deep vs surf")
print(p4)
## `geom_smooth()` using formula = 'y ~ x'
model <- lm(analysis_dataset$surf~analysis_dataset$deep)
summary(model)
##
## Call:
## lm(formula = analysis_dataset$surf ~ analysis_dataset$deep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.60164 -0.34081 0.01032 0.30539 1.12548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.92426 0.26235 14.958 < 2e-16 ***
## analysis_dataset$deep -0.30902 0.07051 -4.383 2.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5019 on 164 degrees of freedom
## Multiple R-squared: 0.1048, Adjusted R-squared: 0.09939
## F-statistic: 19.21 on 1 and 164 DF, p-value: 2.085e-05
#p-value being second lowest in this comparison
#Attitude vs surf
p1 <- ggplot(data.frame(analysis_dataset), aes(x = Attitude, y = surf))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3+ggtitle("Attitude vs surf")
print(p4)
## `geom_smooth()` using formula = 'y ~ x'
model <- lm(analysis_dataset$surf~analysis_dataset$Attitude)
summary(model)
##
## Call:
## lm(formula = analysis_dataset$surf ~ analysis_dataset$Attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08644 -0.40136 -0.01364 0.35040 1.50259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.18686 0.17968 17.737 <2e-16 ***
## analysis_dataset$Attitude -0.01272 0.00557 -2.283 0.0237 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5222 on 164 degrees of freedom
## Multiple R-squared: 0.03082, Adjusted R-squared: 0.02491
## F-statistic: 5.214 on 1 and 164 DF, p-value: 0.02368
#p-value being third lowest in this comparison. The abovementioned three statistical models (Attitude vs surf, surf vs deep, Attitude vs Points) show a rather significant coefficient. In the linear model especially Attitude (predictor variable) vs Points (response variable) have a statistically strong correlation.
#Creating a multiple variable model
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggpairs(analysis_dataset, lower = list(combo = wrap("facethist", bins = 20)))
my_model2 <- lm(Points ~ Attitude + stra, data = analysis_dataset)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = analysis_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
my_model2 <- lm(Points ~ Attitude + stra, data = analysis_dataset)
#Linear model with multiple variables
lm(my_model2)
##
## Call:
## lm(formula = my_model2)
##
## Coefficients:
## (Intercept) Attitude stra
## 8.9729 0.3466 0.9137
par(mfrow = c(2, 2))
#Residuals vs Fitted values (which = 1)
plot(my_model2, which = 1)
#Normal QQ-plot (which = 2)
plot(my_model2, which = 2)
#Residuals vs Leverage (which = 5)
plot(my_model2, which = 5)
#Multiple explanatory variables all in one
plot(my_model2, which = c(1, 2, 5))
In the Residual vs Fitted plot the points are fairly randomly scattered
and follow the line. It suggest the assumption is reasonable. In the QQ
plot there is some deviation from the norm indicating departure from
normality. Residuals vs Leverage on the other hand shows a uneven
distribution suggesting heteroscedasticity.
#Making predictions
m <- lm(Points ~ Attitude, data = analysis_dataset)
summary(m)
##
## Call:
## lm(formula = Points ~ Attitude, data = analysis_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
As we can se above the Adjusted R-squared is 0.1856. This is not a sign of a good model.
#New observations
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(Attitude = new_attitudes)
print(new_data)
## Attitude
## Mia 3.8
## Mike 4.4
## Riikka 2.2
## Pekka 2.9
#Predicting exam points based on the model created above.
predictions <- predict(m, newdata = new_data)
print(predictions)
## Mia Mike Riikka Pekka
## 12.97683 13.18836 12.41275 12.65954
Here we go again…
Konrad Sopyllo
The data presented below concers student performance in Portugese high school education. It focuses mainly onperformance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The data was collected by Cortez and Silva in 2008.
url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"
alc_data <- read.csv(url)
variable_names <- names(alc_data)
cat("Variable Names:", paste(variable_names, collapse = ", "), "\n\n")
## Variable Names: school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, schoolsup, famsup, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, failures, paid, absences, G1, G2, G3, alc_use, high_use
In this section we will choose four random factors and analyse their correlation with alcohol consumption. First we will make a hypothesis on the result and then calculate it. In addition we will interpret coefficients of the models and their odds ratios
#Hypothesis 1. Romantic relationship is correlated with higher alcohol use.
library(tidyverse)
gathered_data <- alc_data %>%
gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)
ggplot(gathered_data, aes(x = romantic, fill = variable, y = alcohol_consumption)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
labs(title = "Average Alcohol Consumption vs Romantic relationship",
x = "romantic",
y = "Average Alcohol Consumption") +
scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
labels = c("Workday", "Weekend")) +
theme_minimal()
logistic_model <- glm(high_use ~ Dalc + Walc + romantic,
data = alc_data,
family = binomial,
maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = high_use ~ Dalc + Walc + romantic, family = binomial,
## data = alc_data, maxit = 1000)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.432e+02 4.161e+05 -0.001 1
## Dalc 5.423e+01 1.122e+05 0.000 1
## Walc 5.397e+01 9.790e+04 0.001 1
## romanticyes -2.844e-01 9.737e+04 0.000 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5204e+02 on 369 degrees of freedom
## Residual deviance: 3.4622e-10 on 366 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
##
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios,
Lower_CI = conf_intervals[, 1],
Upper_CI = conf_intervals[, 2])
## Odds_Ratio Lower_CI Upper_CI
## (Intercept) 2.333493e-106 0 0
## Dalc 3.569417e+23 0 Inf
## Walc 2.755430e+23 0 Inf
## romanticyes 7.524331e-01 0 NA
Romantic relationship decreases the chances of high alcohol use. The odds ratio is 0.8 which means that the chances are smaller when in romantic realtionship.
#Hypothesis 2. more activities is related to higher alcohol use.
gathered_data <- alc_data %>%
gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)
ggplot(gathered_data, aes(x = activities, fill = variable, y = alcohol_consumption)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
labs(title = "Average Alcohol Consumption vs Activities",
x = "Activities",
y = "Average Alcohol Consumption") +
scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
labels = c("Workday", "Weekend")) +
theme_minimal()
#regression analysis
logistic_model <- glm(high_use ~ Dalc + Walc + activities,
data = alc_data,
family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = high_use ~ Dalc + Walc + activities, family = binomial,
## data = alc_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -207.2808 56619.7972 -0.004 0.997
## Dalc 46.2209 15226.4671 0.003 0.998
## Walc 45.9976 13263.2386 0.003 0.997
## activitiesyes -0.1948 12574.1620 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5204e+02 on 369 degrees of freedom
## Residual deviance: 1.8847e-08 on 366 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 25
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
##
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios,
Lower_CI = conf_intervals[, 1],
Upper_CI = conf_intervals[, 2])
## Odds_Ratio Lower_CI Upper_CI
## (Intercept) 9.530179e-91 0.000000e+00 0.000000e+00
## Dalc 1.184331e+20 1.865684e-115 7.518102e+154
## Walc 9.473186e+19 Inf Inf
## activitiesyes 8.230000e-01 0.000000e+00 NA
logistic_model <- glm(high_use ~ Dalc + Walc + goout,
data = alc_data,
family = binomial,
maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = high_use ~ Dalc + Walc + goout, family = binomial,
## data = alc_data, maxit = 1000)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.434e+02 4.263e+05 -0.001 1
## Dalc 5.424e+01 1.121e+05 0.000 1
## Walc 5.399e+01 9.967e+04 0.001 1
## goout -1.091e-03 4.686e+04 0.000 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5204e+02 on 369 degrees of freedom
## Residual deviance: 3.4654e-10 on 366 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
##
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios,
Lower_CI = conf_intervals[, 1],
Upper_CI = conf_intervals[, 2])
## Odds_Ratio Lower_CI Upper_CI
## (Intercept) 1.980652e-106 0 0
## Dalc 3.582535e+23 0 Inf
## Walc 2.813755e+23 0 Inf
## goout 9.989097e-01 NA Inf
Extracurricular activities decrease alchol consumption. -p-value is low thus the result is significant.
#Hypothesis 3. Amount of going out is correlated with higher alcohol use
gathered_data <- alc_data %>%
gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)
ggplot(gathered_data, aes(x = goout, fill = variable, y = alcohol_consumption)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
labs(title = "Average Alcohol Consumption vs Going out",
x = "Going out",
y = "Average Alcohol Consumption") +
scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
labels = c("Workday", "Weekend")) +
theme_minimal()
logistic_model <- glm(high_use ~ Dalc + Walc + goout,
data = alc_data,
family = binomial,
maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = high_use ~ Dalc + Walc + goout, family = binomial,
## data = alc_data, maxit = 1000)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.434e+02 4.263e+05 -0.001 1
## Dalc 5.424e+01 1.121e+05 0.000 1
## Walc 5.399e+01 9.967e+04 0.001 1
## goout -1.091e-03 4.686e+04 0.000 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5204e+02 on 369 degrees of freedom
## Residual deviance: 3.4654e-10 on 366 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
##
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios,
Lower_CI = conf_intervals[, 1],
Upper_CI = conf_intervals[, 2])
## Odds_Ratio Lower_CI Upper_CI
## (Intercept) 1.980652e-106 0 0
## Dalc 3.582535e+23 0 Inf
## Walc 2.813755e+23 0 Inf
## goout 9.989097e-01 NA Inf
The more you go out, the more you drink alcohol. -Given the p-valuen (0.02) the results are statistically significant. -One unit increase in going out, alcohol consumption increases by 1.3
#Hypothesis 4. increased internet usage is correlated with smaller alcohol consumption.
gathered_data <- alc_data %>%
gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)
ggplot(gathered_data, aes(x = internet, fill = variable, y = alcohol_consumption)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
labs(title = "Average Alcohol Consumption vs Internet Use",
x = "Internet Use",
y = "Average Alcohol Consumption") +
scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
labels = c("Workday", "Weekend")) +
theme_minimal()
logistic_model <- glm(high_use ~ Dalc + Walc + internet,
data = alc_data,
family = binomial,
maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = high_use ~ Dalc + Walc + internet, family = binomial,
## data = alc_data, maxit = 1000)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -243.506 429674.814 -0.001 1
## Dalc 54.231 112021.326 0.000 1
## Walc 53.991 97780.360 0.001 1
## internetyes 0.148 131720.810 0.000 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5204e+02 on 369 degrees of freedom
## Residual deviance: 3.4659e-10 on 366 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
##
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios,
Lower_CI = conf_intervals[, 1],
Upper_CI = conf_intervals[, 2])
## Odds_Ratio Lower_CI Upper_CI
## (Intercept) 1.764963e-106 0 0
## Dalc 3.566017e+23 0 Inf
## Walc 2.804818e+23 Inf Inf
## internetyes 1.159493e+00 NA Inf
Internet use is increased during alcohol consumption. Thus the assumption is false. -Given the p-valuen (0.02) the results are statistically significant. -One unit increase in internet usage, alcohol consumption increases by 1.2
#In the last section we do a prediction power model vs simple guessing of the results.
predictions <- predict(logistic_model, type = "response") > 0.5
conf_matrix <- table(Actual = alc_data$high_use, Predicted = predictions)
print(conf_matrix)
## Predicted
## Actual FALSE TRUE
## FALSE 259 0
## TRUE 0 111
training_error <- sum(alc_data$high_use != predictions) / nrow(alc_data)
cat("\nTraining Error:", training_error, "\n")
##
## Training Error: 0
library(ggplot2)
ggplot(alc_data, aes(x = as.factor(high_use), fill = as.factor(predictions))) +
geom_bar(position = "dodge") +
labs(title = "Actual vs. Predicted Values",
x = "Actual Values",
y = "Count",
fill = "Predicted Values") +
theme_minimal()
training_error <- sum(alc_data$high_use != predictions) / nrow(alc_data)
cat("\nTraining Error:", training_error, "\n")
##
## Training Error: 0
majority_class <- as.numeric(names(sort(table(alc_data$high_use), decreasing = TRUE)[1]))
## Warning: NAs introduced by coercion
simple_guess_predictions <- rep(majority_class, nrow(alc_data))
model_accuracy <- sum(alc_data$high_use == predictions) / nrow(alc_data)
simple_guess_accuracy <- sum(alc_data$high_use == simple_guess_predictions) / nrow(alc_data)
cat("Model Accuracy:", model_accuracy, "\n")
## Model Accuracy: 1
cat("Simple Guessing Accuracy:", simple_guess_accuracy, "\n")
## Simple Guessing Accuracy: NA
The model has 100% success rate and no false results. There are also 0 incorrectly typed individuals.
#Bonus I
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
alc_data$high_use <- as.factor(alc_data$high_use)
set.seed(123)
ctrl <- trainControl(method = "cv", number = 10)
cv_model <- train(high_use ~ Dalc + Walc + romantic, data = alc_data, method = "glm", family = binomial, trControl = ctrl)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(cv_model)
## Generalized Linear Model
##
## 370 samples
## 3 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 333, 333, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 1 1
previous_model_error <- 0.26
cv_model_error <- 1 - cv_model$results$Accuracy
cat("\nPrevious Model Error:", previous_model_error, "\n")
##
## Previous Model Error: 0.26
cat("Cross-Validation Model Error:", mean(cv_model_error), "\n")
## Cross-Validation Model Error: 0
Bonus I: Performing a 10 fold cross validation and checking the test performance.
In the first section, the coding contains creation of the training control. Since both accuracy and Kappa are 1, this means the model has a perfect classification and agreement of outome respectively. It also performs better than the prevous model, since the error is 0.
#Bonus II
Konrad Sopyllo
The dataset used for this weeks assignment describes the housing values in Suburbs of Boston. The set is shown below. The meaning of the factors are as follows:crim = per capita crime rate by town, zn = proportion of residential land zoned for lots over 25,000 sq.ft., indus = proportion of non-retail business acres per town, chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise), nox = nitrogen oxides concentration (parts per 10 million), rm = average number of rooms per dwelling, age = proportion of owner-occupied units built prior to 1940, dis = weighted mean of distances to five Boston employment centres, rad = index of accessibility to radial highways, tax = full-value property-tax rate per $10,000, ptratio = pupil-teacher ratio by town, black = 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town, lstat = lower status of the population (percent), medv = median value of owner-occupied homes in $1000s
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Showing a graphical overview of the data and correlations. It is used to predict the value of owner-occupied homes.
par(mar = c(2, 2, 1, 1))
par(mfrow = c(4, 4))
for (i in 1:14) {
plot(Boston[, i], main = names(Boston)[i], col = "lightblue", pch = 20)
}
par(mfrow = c(1, 1))
par(mar = c(5, 4, 4, 2) + 0.1)
#standardizing the data set and creating a training set.
scaled_data <- as.data.frame(scale(Boston))
summary(scaled_data)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
scaled_data$crime_category <- cut(scaled_data$crim, breaks = quantile(scaled_data$crim), labels = FALSE)
scaled_data <- scaled_data[, -which(colnames(scaled_data) == "crim")]
set.seed(123)
train_indices <- sample(1:nrow(scaled_data), 0.8 * nrow(scaled_data))
train_set <- scaled_data[train_indices, ]
test_set <- scaled_data[-train_indices, ]
Mean of each variable is now close to 0 and standard deviation is close to 1.
Linear discriminant analysis
library(ggplot2)
lda_model <- lda(crime_category ~ ., data = train_set)
lda_data <- data.frame(predict(lda_model, train_set)$x, crime_category = train_set$crime_category)
ggplot(lda_data, aes(x = LD1, y = LD2, color = crime_category)) +
geom_point() +
ggtitle("LDA Biplot") +
xlab("LD1") +
ylab("LD2") +
theme_minimal()
In the plot shown above the x and y axes represent the linear
discriminant functions. They are linear combinations of the original
variables.
LD1: the first linear discriminant function maximizes the separation
between different categories LD2:The second linear discriminant
function, which is orthogonal to the former, captures additional
variation not explained by the LD1. In the plot we can crealy see that
the higher crime categories (mainly 4) are separated from the rest. This
means less violent crimes can be separated from the rest.
#Predicting classes with he LDA model
actual_crime_categories <- test_set$crime_category
test_set <- test_set[, -which(colnames(test_set) == "crime_category")]
predicted_classes <- predict(lda_model, newdata = test_set)$class
confusion_matrix <- table(actual = actual_crime_categories, predicted = predicted_classes)
print(confusion_matrix)
## predicted
## actual 1 2 3 4
## 1 9 10 4 0
## 2 2 17 6 0
## 3 1 9 13 2
## 4 0 0 1 27
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Overall Accuracy:", round(accuracy, 4), "\n")
## Overall Accuracy: 0.6535
The plot above consists of tru positive, true negative, false positive and false negative values. Hence numbers 1-4. Within the plot we can se instances fitting these categories. The overall accuracy is (TP+TN)/Total. In this case the overall accuracy is 65% whis is not great.
#K-means algorithm
data(Boston)
scaled_data <- scale(Boston)
distances <- dist(scaled_data)
set.seed(123)
k_values <- 2:10
kmeans_results <- lapply(k_values, function(k) kmeans(scaled_data, centers = k))
wss <- sapply(kmeans_results, function(km) sum(km$withinss))
plot(k_values, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)", ylab = "Within-Cluster Sum of Squares (WSS)",
main = "Elbow Method for Optimal k")
optimal_k <- k_values[which.min(wss)]
optimal_kmeans <- kmeans(scaled_data, centers = optimal_k)
pairs(scaled_data, col = optimal_kmeans$cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
scaled_data_df <- as.data.frame(scaled_data)
pairs(scaled_data_df, col = optimal_kmeans$cluster)
fviz_cluster(optimal_kmeans, data = scaled_data_df, geom = "point",
ggtheme = theme_minimal(), ellipse.type = "convex", ellipse.level = 0.68)
First the variables in Boston dataset are being standardized. In order
to alculate the distances between the observations based on the
standardized variables we use dist(). The elbow method is used to
identify the optimal number of clusters by using the within-cluster sum
of squares (WSS). Then k-means algorithm is run again with the optimal
number of clusters. For better visualisation the data is presented above
in the graphs. As shown in the first graph, 10 is good number for the
amount of clusters. In the cluster plot on the other hand we can see
some clusters being more distant from another. In general the seem to
overlap with neighbouring ones.
#Bonus I -Performing a k-means on the original data with >2 clusters.
library(MASS)
library(cluster)
data(Boston)
scaled_data <- scale(Boston)
set.seed(123)
num_clusters <- 3
kmeans_results <- kmeans(scaled_data, centers = num_clusters)
lda_model <- lda(factor(kmeans_results$cluster) ~ ., data = Boston)
lda_scores <- predict(lda_model)$x
plot(lda_scores[, 1], lda_scores[, 2], col = as.factor(kmeans_results$cluster), pch = 16,
main = "LDA Biplot-Like Visualization", xlab = "LD1", ylab = "LD2")
points(lda_model$means[, 1], lda_model$means[, 2], col = 1:num_clusters, pch = 3, cex = 2)
legend("topright", legend = 1:num_clusters, col = 1:num_clusters, pch = 3, title = "Clusters")
As we can see in the graph above the cluster 1 is certainly more detached than the others. Cluster 2 and 3 are more or less grouped together but there is still a rather strong separation visible at LD2 arounf point 0.
#Bonus II
Konrad Sopyllo
date()
## [1] "Wed Dec 6 17:47:39 2023"
To avoid errors I will use Kimmos ready data.
library(readr)
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Data relations
rownames(human) <- human$Country
## Warning: Setting row names on a tibble is deprecated.
human$Country <- NULL
head(human)
## # A tibble: 6 × 8
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI Mat.Mor Ado.Birth Parli.F
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.01 0.891 81.6 17.5 64992 4 7.8 39.6
## 2 0.997 0.819 82.4 20.2 42261 6 12.1 30.5
## 3 0.983 0.825 83 15.8 56431 6 1.9 28.5
## 4 0.989 0.884 80.2 18.7 44025 5 5.1 38
## 5 0.969 0.829 81.6 17.9 45435 6 6.2 36.9
## 6 0.993 0.807 80.9 16.5 43919 7 3.8 36.9
library(ggplot2)
library(dplyr)
subset_vars <- c("Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
subset_data <- human[, subset_vars]
pairs(subset_data)
summary(subset_data)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
By looking at the data above we ca see that Expected years of education (Edu.Exp) and life expectancy (Life.Exp) present with aa positive linear trend. It suggests that with the progression of Edu.Exp there is a correlation with Life.Exp. In other words the higher the education level, the higher the life expectancy. No clear assosiation is visible with other variables.
#Principal component analysis
pca_result <- prcomp(human[, c("Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")], scale = TRUE)
library(ggplot2)
library(ggrepel)
ggplot(data = as.data.frame(pca_result$x), aes(x = PC1, y = PC2)) +
geom_point(size = 3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
geom_text_repel(aes(label = rownames(as.data.frame(pca_result$x))), box.padding = 0.5, max.overlaps = Inf) +
labs(title = "PCA Biplot") +
theme_minimal()
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("FactoMineR")
##
## The downloaded binary packages are in
## /var/folders/4g/x6rsz1195b5dtvzn9lvwxp1w0000gn/T//RtmpVQ2iTi/downloaded_packages
library(FactoMineR)
human_standardized <- scale(human)
pca_result_standardized <- PCA(human_standardized, graph = FALSE)
plot(pca_result_standardized, choix = 'ind', cex = 0.8, col.hab = as.factor(rownames(human)))
The biplot shows the relationship between ountries based on on PC1 and
PC2. The arrows represent original variables and their correlation.
Countries close to each other show similar pattern of variation.
In the standardized PCA values variables with higher SD are emphasized. Overall standardisation is beneficial and presents with more clear data.
Variables pointing in the direction of PC1 have a strong influence on its variation. Arrow lenght indicates the strength of each variable’s contribution. Countries to the left of the axis PC1 have a lower score. On the other hand varbiales pointing in the direction of PC2 have a strong influence on its variation.
install.packages(c("FactoMineR", "factoextra"))
##
## The downloaded binary packages are in
## /var/folders/4g/x6rsz1195b5dtvzn9lvwxp1w0000gn/T//RtmpVQ2iTi/downloaded_packages
library(FactoMineR)
library(factoextra)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
tea_mca_data <- tea[, c("breakfast", "tea.time", "evening", "lunch", "dinner", "always", "home", "work",
"tearoom", "friends", "resto", "pub", "Tea", "How", "sugar", "how", "where",
"price", "age", "sex", "SPC", "Sport", "age_Q", "frequency", "escape.exoticism",
"spirituality", "healthy", "diuretic", "friendliness", "iron.absorption",
"feminine", "sophisticated", "slimming", "exciting", "relaxing",
"effect.on.health")]
tea_mca_data[] <- lapply(tea_mca_data, as.factor)
mca_result <- MCA(tea_mca_data)
## Warning: ggrepel: 29 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
fviz_mca_biplot(mca_result, col.ind = "cos2", col.ind.sup = "blue",
cex = 0.7, title = "MCA Variable Biplot")
## Warning: Duplicated aesthetics after name standardisation: size
## Warning: Duplicated aesthetics after name standardisation: size
plot.MCA(mca_result, invisible = "ind", col.ind = "cos2", col.ind.sup = "blue",
cex = 0.5, title = "MCA Variable Biplot")
As shown in the Biplot factors placed closest to each other have the strongest correlation with each other.
###Assignment 6
(date)
## function (x)
## {
## UseMethod("date")
## }
## <bytecode: 0x7f7ed0c936d8>
## <environment: namespace:lubridate>
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(corrplot)
library(GGally)
library(tibble)
library(FactoMineR)
library(factoextra)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
rats <- read.table("data/rats.txt", sep = ",", header = T)
str(rats)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ time : int 1 1 1 1 1 1 1 1 1 1 ...
Changing the categorical variables ID, Group and WD
rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
rats$WD <- factor(rats$WD)
str(rats)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ time : int 1 1 1 1 1 1 1 1 1 1 ...
The data compares three different diet types (groups) and their effect on the rats development ie mass. The weight has been measured over 9 weeks with 11 different time points.
rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
rats$WD <- factor(rats$WD)
str(rats)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ time : int 1 1 1 1 1 1 1 1 1 1 ...
Weight development over time in different groups
ggplot(rats, aes(x = time, y = weight, linetype=ID, col=Group)) +geom_line() + scale_linetype_manual(values = rep(1:10, times=4)) + theme(legend.position = "none") + scale_y_continuous(limits = c(min(rats$weight), max(rats$weight))) + facet_grid(. ~ Group, labeller = label_both) + xlab("time (days)")
Plotting values using standardized values. Shows the distribution better
in the beginning.
rats <- rats %>% group_by(time) %>% mutate(stdweight = (weight-mean(weight))/sd(weight)) %>% ungroup()
ggplot(rats, aes(x = time, y = stdweight, linetype=ID, col=Group)) +geom_line() + scale_linetype_manual(values = rep(1:10, times=4)) + theme(legend.position = "none") + facet_grid(. ~ Group, labeller = label_both) + xlab("time (days)") + ylab("scaled weight")
As we can see the Group 1 has a much lower weight to from the very
beginning, it has also some minimal increase throughout the observation
period. Groups 2 and 3 have higher weight increase throughout the
observation period however
rats2 <- rats %>% group_by(Group, time) %>% summarise( mean = mean(weight), se = (sd(weight)/sqrt(length(weight)))) %>% ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
#average weight change over the observation period
ggplot(rats2, aes(x = time, y = mean, linetype = Group, shape = Group, col=Group)) + geom_line() + scale_linetype_manual(values = c(1,2,3)) + geom_point(size=3) + scale_shape_manual(values = c(1,2,3)) + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) + scale_y_continuous(name = "mean(weight) +/- se(weight)") + scale_x_continuous(name = "time (days)")
#total change in weight
rats3 <- rats %>% filter(time > 1) %>% group_by(Group, ID) %>% summarise(mean=mean(weight)) %>% ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(rats3, aes(x = Group, y = mean, col=Group)) + geom_boxplot() + stat_summary(fun = "mean", geom = "point", shape=23, size=2.5, fill = "white") + scale_y_continuous(name = "mean(weight), days 8-64")
rats4 <- subset(rats3, rats3$mean<550)
ggplot(rats4, aes(x = Group, y = mean, col=Group)) + geom_boxplot() + stat_summary(fun = "mean", geom = "point", shape=23, size=2.5, fill = "white") + scale_y_continuous(name = "mean(weight), days 8-64")
Group 1 has the smallest weight. Group 2 has the biggest increase in weight gain. Group 3 have the biggest weight however the gain is not as big as in group 2.
##T-test to evaluate the group differences and suitability into the linear model
rats5 <- subset(rats4, rats4$Group==1 | rats4$Group==2)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -25.399, df = 9, p-value = 1.094e-09
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -204.0633 -170.6867
## sample estimates:
## mean in group 1 mean in group 2
## 265.025 452.400
#Group 1 versus Group 2
rats5 <- subset(rats4, rats4$Group==1 | rats4$Group==3)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
## -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3
## 265.025 527.500
#Group 1 versus Group 3
rats5 <- subset(rats4, rats4$Group==2 | rats4$Group==3)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -109.37769 -40.82231
## sample estimates:
## mean in group 2 mean in group 3
## 452.4 527.5
#Group 2 versus Group 3
rats5 <- subset(rats4, rats4$Group==2 | rats4$Group==3)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -109.37769 -40.82231
## sample estimates:
## mean in group 2 mean in group 3
## 452.4 527.5
#Group 2 versus Group 3
rats5 <- subset(rats3, rats3$Group==2 | rats4$Group==3)
## Warning in rats3$Group == 2 | rats4$Group == 3: longer object length is not a
## multiple of shorter object length
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -0.83974, df = 5, p-value = 0.4393
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -149.45079 75.85079
## sample estimates:
## mean in group 2 mean in group 3
## 487.8 524.6
baseline <- subset(rats, rats$time==1)
rats5 <- rats3 %>% mutate(baseline = baseline$weight)
fit <- lm(mean ~ baseline + Group, data = rats5)
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing Group 1 to Groups 2 and 3 has small p-values. Hence there is a significant difference in the weight. Comparing groups 2 and 3 shows already a smaller p -value. Moreover removing the outlines increases the p-value even more questioning the correlation.
Based on the Anova analysis it might be good to choose rats with a similar starting weight as that is the strongest predictive value.
#BPRS
bprs <- read.table("data/bprs.txt", header=T, sep = ",")
str(bprs)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dose : int 42 58 54 55 72 48 71 30 41 57 ...
bprs$subject <- 1:40
bprs$treatment <- factor(bprs$treatment)
bprs$subject <- factor(bprs$subject)
bprs$week <- factor(bprs$week)
str(bprs)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 40 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ week : Factor w/ 9 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ dose : int 42 58 54 55 72 48 71 30 41 57 ...